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Abstract 

This  paper  describes  an  implementation 
of  an  attentional  system  for  a  humanoid 
robot  based  completely  on  space  variant 
vision  (in  particular  log-polar).  The  aim 
is  that  of  providing  the  robot  with  a 
suitable  measure  of  position,  speed  and 
saliency  of  possibly  interesting  objects 
for  saccading  and  tracking.  The  major 
advantage  of  log-polar  based  imaging  is 
related  to  the  reduced  number  of  pixel 
while  maintaining  a  large  field  of  view. 
This  arrangement  is  very  well  suited  for 
motor  control,  where  the  high-resolution 
center  (fovea)  allows  precise  positioning 
and.  at  the  same  time,  the  coarse 
resolution  periphery  permits  detection  of 
potential  targets.  Algorithms  for  color 
processing,  optic  flow,  and  disparity 
computation  were  developed  within  this 
architecture.  The  attentional  modules  are 
intended  as  the  first  layer  of  a  more 
complicated  system,  which  shall  include 
learning  of  object  recognition,  trajectory 
tracking,  and  naive  physics 
understanding  during  the  natural 
interaction  of  the  robot  with  the 
environment. 

Introduction 

Besides  the  studies  on  artificial  neural 
networks,  substantial  effort  is  devoted 
worldwide  to  build  physical  models  of 
parts  of  biological  systems  with  the  aim 
of  suggesting  new  solutions  to  robotics 


but  more  importantly  with  the  ultimate 
goal  of  gaining  a  better  understanding  of 
how  the  brain  of  living  systems  solves 
the  same  sort  of  problems.  Examples  of 
this  approach  could  be  found  in  [1-6]. 
Although  the  “robotic”  models  are  thus 
far  only  crude  approximations  of  real 
living  organisms,  the  motivations  of  the 
approach  are  rooted  in  the  belief  that 
“constructing”  a  real  system  might 
reveal  problems  and  subtleties  that  a 
mere  analysis  could  not. 

Along  this  line  of  research  we  developed 
an  attentional  system  for  a  humanoid 
robot.  The  peculiarity  of  this  work  is  in 
the  use  of  space  variant  vision.  In 
particular  we  employed  log-polar 
images,  which,  as  described  later  on. 
model  accurately  how  photoreceptors  are 
distributed  in  our  retinas.  Although,  on  a 
first  inspection,  it  might  seem  that  space 
variance  poses  more  challenges  than 
traditional  rectangular  imaging,  we  will 
show  that  very  simple  strategies  might 
be  employed  to  adapt  the  algorithms  to 
the  log-polar  geometry.  On  the  other 
hand,  we  gain  (from  the  space  variant 
sampling)  the  possibility  to  maintain  a 
large  field  of  view  and  at  the  same  time 
process  a  limited  number  of  pixels.  This 
is  advantageous  since  allows  the  system 
to  be  simultaneously  maximally 
responsive  to  new  events  and  maximally 
precise  in  its  movements  (highest 
resolution  in  the  image  center).  The 
robot  exploits  color,  motion,  and 
binocular  cues  to  derive  information 
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about  potentially  interesting  targets  for 
pursuit  and  saccading.  Other  interesting 
aspects,  borrowed  from  biology,  concern 
the  use  of  inertial  information  to 
stabilize  the  visual  world  in  spite  of  the 
movement  of  the  robot  or  external 
disturbances. 

The  experimental  setup  is  a  seven 
degrees  of  freedom  robot  head,  with 
human  like  performance  in  terms  of 
speed  and  acceleration  (see  Figure  1). 
For  the  scope  of  this  paper,  the  sensory 
system  consists  of  a  pair  of  cameras 
(standard  CCD;  sub-sampling  to  log- 
polar  is  carried  out  in  software),  an 
inertial  sensor  (InterSense  IS300)  which 
measures  the  roll,  pitch,  and  yaw  angles, 
and  high  resolution  motor  encoders 
providing  the  position  of  each  joint. 
Visual  processing  and  control  are  carried 
out  by  a  set  of  PCs  connected  through  a 
fast  network  and  running  a  real-time  OS. 
Video  signals  are  synchronized,  split, 
and  sent  in  parallel  to  many  nodes  for 
parallel  processing.  Nine  Pentium  class 
processors  are  employed  in  the  present 
implementation. 


Figure  1  The  robot  setup  “Lazio”.  The  robot 
head  mounts  four  cameras  (only  two  used  at 
the  moment).  It  has  vergence  and  three 
independent  tilt  joints,  a  pan  a  the  level  of  the 
neck  and  a  roll.  One  of  the  tilt  and  the  roll 
movements  are  obtained  by  means  of  a 
differential  joint. 


Log-polar  vision 


Among  the  many  possible  space  variant 
sub-sampling  procedures  the  one  we 
used  is  known  as  log-polar.  The  log- 
polar  mapping  well  resembles  the 
distribution  of  the  photoreceptors  in  the 
primates’  retina  as  well  as  the 
geometrical  transform  following  the 
projection  of  these  neurons  into  the 
primary  visual  cortex  [7-10].  The  initial 
analytical  formulation  based  on  studies 
on  the  primates’  visual  pathways  is  due 
mainly  to  Schwartz  [11];  his  model  can 
be  roughly  summarized  as  follows: 

•  The  distribution  of  the 
photoreceptors  in  the  retina  is  not 
uniform.  They  lay  more  densely  in 
the  central  region  called  fovea,  while 
they  are  sparser  in  the  periphery. 
Consequently,  the  resolution  also 
decreases  moving  away  from  the 
fovea  toward  the  periphery.  It  has  a 
radial  symmetry,  which  can  be 
approximated  by  a  polar  distribution. 

•  The  projection  of  the  photoreceptors 
array  into  the  primary  visual  cortex 
can  be  well  described  by  a  log-polar 
distribution  mapped  onto  an  almost 
rectangular  surface  (the  cortex). 

From  the  mathematical  point  of  view, 
the  log-polar  mapping  can  be  expressed 
as  a  transformation  between  a  polar 
plane  (p,0)  (retinal  plane)  and  a 
Cartesian  plane  (£//)  (cortical  plane),  as 
follows: 


rt  =  q& 


(0.1) 


where  is  the  radius  of  the  innermost 
circle,  1/q  is  the  minimum  angular 
resolution  of  the  log-polar  layout,  and 
(p.ff)  are  the  polar  coordinates.  k$  is  a 
linear  scaling  parameter:  this  has  been 
added  to  the  original  formulation  in 
order  to  fit  the  mapping  into  a  fixed  size 
squared  image  (which  is  determined  by 


the  frame  grabber  characteristics).  (p,8) 
are  related  to  the  conventional  Cartesian 
reference  system  by: 

{x  —  pcos  $ 
y  =  psint? 


(0.2) 


A  pictorial  example  is  shown  in  Figure 
2,  where  the  leftmost  panel  (a)  shows  a 
Cartesian  or  retinal  image  (before  sub¬ 
sampling)  and  the  corresponding  log- 
polar  (or  cortical)  image  on  the  right  (b). 


Figure  2  An  example  of  log-polar  mapping, 
note  as  radial  structures  in  the  flower  (petals) 
map  to  horizontal  structures  in  the  log-polar 
image.  Circles,  on  the  other  hand,  map  to 
vertical  patterns. 


Optic  flow 


Optic  flow  as  described  by  Horn  is  “ the 
apparent  motion  of  pixels  in  the  image 
plane"  [12],  Horn  proposed  also  a 
continuity  constraint  for  the  optic  flow 
involving  the  spatio-temporal  derivatives 
of  the  image  intensity: 


dl  dl  3/  . 
—  u+ — v  +  —  =  0 
ox  ay  dt 


(0.3) 


where  I  is  the  image  intensity  and  u,v  the 
flow  components.  The  two  components 
cannot  be  recovered  from  equation  (0.3) 
alone.  If  we  assume  that  the  flow  is  well 
represented  by  an  affine  model  such  as: 
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then  a  combination  of  equation  (0.3)  and 
(0.4)  allows  computing  the  parameters  of 
the  affine  model  provided  we  estimated 
it  in  at  least  six  points  -  usually  a  least 
square  approach  is  taken  and  more 
points  are  used.  The  parameters  have  the 
meaning  of  translation,  divergence,  curl, 
and  shear.  The  approach  is  similar  to  that 
proposed  by  [13].  To  take  into  account 
the  log-polar  geometry  we  have  to 
transform  further  equation  (0.3)  into: 

-7-  =  [r,  r,  st  g,  r,  r.  ]  k  \drs,s,]  (0.5) 

dt 

where  the  constants  yand  g  represent 


the  matrix  product  of  the  image 
derivative  with  the  log-polar  Jacobian; 
for  a  complete  derivation  see  [6]. 

Further,  by  processing  the  optic  flow  we 
can  determine  which  parts  of  the  image 
are  moving,  and  consequently  segment 
the  target  from  the  background  (in  those 
cases  when  they  are  moving  differently). 
Roughly  speaking,  this  is  accomplished 
by  computing  the  expected  optic  flow 
due  to  the  movement  of  the  camera  and 
subtracting  it  from  the  actual  optic  flow. 
Where  the  two  differ  enough  (by  a 
suitable  measure)  the  pixels  could  be 
identified  and  labeled  as  an  independent 
moving  object. 

The  expected  flow  is  determined  by 
using  a  constant  approximation  of  the 
image  Jacobian: 


«0 

v0 


=  Jq 


(0.6) 


The  matrix  J  is  estimated  by  incremental 
least  squares  and  by  collecting  example 
pairs  of  the  joint  speed  q  versus  the 
measured  optic  flow.  An  appropriate 
delay  line  takes  care  of  synchronizing 
the  two  signals. 

The  actual  segmentation  algorithm  in 
this  case  develops  on  the  Horn  equation. 
It  suffices  to  note  that  equation  (0.3)  is 
satisfied  when  the  flow  vector  are  in 


“agreement”  with  the  spatio-temporal 
gradient  of  /.  On  the  other  hand,  where 
the  equation  is  not  satisfied  it  means  that 
the  expected  flow  is  not  correct  for  that 
pixel.  Consequently,  by  identifying  the 
regions  where  the  expected  flow  causes: 


a/  a/ 

dxU  +  dy 


>£ 


(0.7) 


we  can  segment  the  ego-motion 
component  from  the  moving  object(s) 
itself/themselves.  £  in  equation  (0.7)  is 
a  suitable  threshold. 

It  is  fair  to  say  that  various  reasons  not 
necessarily  related  to  the  presence  of  an 
object  might  cause  the  flow  not  to 
respect  the  Horn  constraint.  These 
include  the  presence  of  strong  edges 
(where  the  spatial  gradient  is  high)  or 
fast  movements  of  the  head,  for  which 
the  linear  prediction  model  is  prone  to 
failure. 


Color  processing 

Color  processing  comes  in  many  flavors 
within  our  attentional  system:  i)  a 
general-purpose  color  segmentation 
algorithm,  ii)  a  color  “blob”  detector, 
and  iii)  a  skin  tone  detector. 

The  general-purpose  segmentation  is 
based  on  histograms.  It  is  started  by  a 
motion  sensitive  cueing  procedure.  It 
subsequently  builds  a  pair  of  histograms: 
one  to  represent  the  target  (the  moving 
object),  and  a  second  that  contains  the 
information  about  the  background.  The 
latter  in  continuously  adapted  and  thus 
provides  a  sort  of  habituation  to  the 
color  of  the  background.  Histograms  are 
constructed  in  the  HSV  color  space;  they 
have  the  form: 

histo(H ,S)  =  h(H,S)/^h(H,S)  (0.8) 

i.e.  they  are  independent  of  the  image 
intensity  (V)  and  normalized  to  one.  A 
pixel  is  assumed  to  belong  to  the  object 


if  its  probability  (an  estimate  of) 
computed  as: 

piobject  |  h,  s)  =  histo(h,  s)  (0.9) 
is  greater  than  a  threshold  and  its 
histogram  does  not  overlap  with  that  of 
the  background. 

The  blob  detector  is  based  on  a  very 
standard  region  growing  procedure. 
Areas  of  uniform  color,  as  measured  by 
taking  into  account  hue  and  saturation 
only,  are  labeled.  A  further  grouping  and 
coherency  test  of  the  resulting  regions  is 
performed  to  eliminate  spurious  results 
(very  likely  due  to  noise).  In  spite  of  its 
simplicity  the  algorithm  provide  a  very 
stable  behavior. 

Finally  a  skin  tone  detector  has  been 
implemented.  It  is  based  on  the 
algorithm  developed  in  [14].  It  has  been 
found  to  improve  the  robot’s  ability  to 
interact  with  humans,  although  it  is  not 
sufficient,  for  example,  to 
unambiguously  detect  faces. 

Disparity  computation 

Binocular  disparity  is  the  strongest  cue 
related  to  depth.  In  the  context  of 
sensori-motor  coordination  it  can  be 
used  to  control  vergence.  A  suitable 
procedure  to  estimate  the  disparity  of  a 
target  (or  of  a  particular  region  of  the 
image)  is  that  of  using  cross-correlation 
(or  another  suitable  distance  measure)  to 
find  the  difference  in  position  between 
the  left  and  right  image  of  that  particular 
region  (representative  of  the  target). 

This  procedure  could  be  implemented  by 

an  exhaustive  search.  In  formula: 

da,  =  arg  max  fv  (lL  (x,  y), lR  (x+d, y))(0. 10) 

d 

where  the  function  /  is  the  pixel 
similarity  measure,  U  the  support  of  /, 
and  II,  Ir  the  left  and  right  image 
respectively. 

While  this  is  fine  in  Cartesian 
coordinates,  it  needs  to  be  modified  in 


the  log-polar  domain  in  order  to  take 
into  account  how  pixels  shift  under  a 
Cartesian  translation  d. 

iL(^n)=iR(ipd(^n))  (o.ii) 

It  is  easy  to  verify  that  the 
transformation  Ipa  itself  does  not  depend 
on  the  actual  images  and  thus  can  be 
computed  beforehand  [15]. 

In  our  case  we  chose  the  normalized 
cross-correlation  as /in  equation  (0.10). 
As  a  consequence  of  the  log-polar 
mapping  an  explicit  segmentation  is  not 
necessary  and  in  fact  U  was  chosen  to  be 
the  whole  image.  Disparity  would  be 
that  of  the  target  as  long  as  it  remains 
close  to  the  foveae  (since  most  of  the 


pixels  would  belong  to  the  target), 
otherwise  the  value  of  disparity  would 
switch  to  that  of  the  background. 

It  is  worth  mentioning  that  the  current 
implementation  assumes  that  the 
transform  from  the  left  to  the  right  image 
is  a  pure  translation  along  the  horizontal 
direction.  This  in  reality  is  unlikely  to  be 
the  case.  Disparity  in  fact  is  strictly 
horizontal  only  when  the  optical  axes  are 
parallel  (and  vertically  aligned).  A 
further  limitation  might  arise  because  of 
the  distortion  of  the  lenses.  In  this  case 
too  the  “pure  translation”  assumption 
would  fail.  This  was  not  the  case  in  our 
configuration  though. 


Color 


Saliency  map 


error 


Figure  3  An  example  of  visual  processing  and  saliency  map.  Basic  processing  modules  are  combined 
to  produce  the  retinal  error  and  the  disparity  signals.  Each  separate  processing  module  provides  a 
list  of  probable  targets  and  relative  hounding  boxes.  A  voting  mechanism  is  used  to  build  the  saliency 
map.  The  position  of  the  maximum  of  the  saliency  map  is  the  retinal  error. 


Integration 

In  order  to  provide  the  controller  with  a 
reasonable  reference  value  the  results 
provided  by  the  various  algorithms  have 
to  be  integrated  in  a  single  percept. 


The  two  quantities  relevant  to  the  control 
of  gaze  direction  are  the  position  of  the 
target  in  the  left-right,  up-down 
directions  and  the  depth  with  respect  to 
the  fixation  point.  These  two  quantities 
are  related  to  two  different  control 
modes:  version  (same  control  values  to 


both  eyes)  and  vergence  (opposite 
commands  to  the  eyes).  The  first 
quantity,  apt  to  control  version,  is 
estimated  by  a  voting  mechanism.  Each 
algorithm  provides  a  list  of  potential 
targets  and  their  bounding  boxes  in 
retinal  coordinates,  and  consequently 
increases  the  saliency  of  the  regions 
identified  by  the  bounding  boxes.  The 
increment  of  saliency  might  be  weighed 
to  give  more  importance  to  particular 
aspects  of  the  targets  (e.g.  skin  tone 
versus  motion).  The  position  of  the 
maximum  of  the  saliency  function 
determines  what  is  tracked. 

Although  it  is  not  a  concern  here,  it  is 
worth  noting  that  the  weights  and  shape 
of  the  attentional  regions  can  be 
modified  on-the-fly  to  give  more  or  less 
importance  to  different  aspects  of  the 
observed  scene,  and  this  can  be  carried 
out  in  relation  to  the  task  or  internal 
status  of  the  robot  [16]. 


Control 

Control  is  mostly  constructed  around  the 
two  quantities  described  in  the  previous 
section.  The  controller  can  be  further 
divided  into  two  sub-modules  dealing 
respectively  with  gaze  stabilization  and 
saccadic  behaviors  (gaze  shifting).  This 
roughly  reflects  two  distinct  functional 
modes  of  the  controller  itself.  Gaze 
stabilization  is  obtained  by  means  of 
closed  loop  controllers  (i.e.  PID),  while 
saccades  are  open  loop. 


The  control  of  the  eyes 

For  what  concerns  the  stabilization  of 
gaze,  the  eyes  are  essentially  controlled 
in  order  to  zero  the  retinal  error: 


9. 

<h 


=  PID(e„e,) 


A*. 


(0.12) 


where  qt,q2, q3  are  the  speed  of  the 
joints  (eye  pan  and  common  tilt),  ex,  ey 
the  retinal  error.  This  particular  module 
does  not  change  the  vergence  angle 
(which  is  adjusted  by  another  control 
loop  instead)  and  consequently  q2  =  q3 . 

A  word  of  caution  is  necessary:  the 
controller  assumes  the  dynamics  of  the 
system  is  negligible.  This  is  only 
approximately  true.  While  stability  is  not 
compromised  (the  control  loop  can  be 
shown  to  be  stable  by  applying  for 
example  the  visual  servoing  theory  [17]), 
the  performances  could  be  nevertheless 
affected  [18].  In  our  case  though,  the 
inertias  involved  are  very  small 
compared  to  the  low-level  PID  gain  and 
thus  the  system's  dynamics  is  truly 
negligible. 

Inertial  stabilization 

Yet  gaze  stabilization  can  be  obtained 
through  other  means.  The  general  idea, 
borrowed  from  the  biological  vestibular 
stabilization  mechanisms  (the  vestibulo- 
ocular  reflex  see  for  example  [19]),  is 
that  of  using  inertial  sensing.  In  our  case, 
three  gyroscopes  are  employed  arranged 
along  three  orthogonal  directions.  A 
simple  controller  can  be  formulated  as 
follows: 


Ax' 

-K 

0  ' 

CO 

<h 
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-k2_ 

yaw 

® pitch  _ 

with  £i  and  &2  two  suitable  gains,  and 
(Oyaw ,  o)pi,ch  the  angular  velocity 

measured  by  the  gyros  along  the  yaw 
and  pitch  direction.  The  intuitive 
description  of  the  controller  is  that  of 
counter  rotating  the  eyes  in  order  to 
compensate  for  the  movement  of  the 
head  or  body  of  the  robot.  A  further  loop 
exploits  the  roll  degree  of  freedom  to 


maintain  the  eyes  approximately  aligned 
with  respect  to  gravity. 

A  more  sophisticated  control  schema 
(optimal  in  the  sense  of  image 
stabilization)  has  been  investigated  in 
[20]  together  with  an  on-line  learning 
strategy. 

Head  control 

The  goal  of  the  head  movements  is 
simply  that  of  repositioning  the  head 
after  the  eyes  have  lost  their  “central” 
position  (symmetric  vergence). 
Essentially,  the  controller  drives  the 
head  joints  in  order  to  zero  the  deviation 
from  the  symmetric  configuration,  or,  in 
the  case  of  the  tilt,  from  a  resting 
configuration  with  the  joints  aligned.  For 
example,  for  the  pan  at  the  level  of  the 
neck  the  controller  is: 

q6  =  PID(q2-q 3)  (0.14) 

This  strategy  alone  would  very  likely 
oscillate  (or  otherwise  the  movements 
must  be  kept  very  slow)  because  the 
head  movements  would  disturb  the 
movement  of  the  eyes.  A  possible 
solution  is  that  of  compensating  the 
movement  of  the  head  by  a  counter 
rotation  of  the  eyes.  This  is  exactly  the 
inertial  stabilization  mechanism  already 
described.  There  is  evidence  that  a 
similar  mechanism  is  employed  by 
humans  to  coordinate  the  head  and  eyes 
[19].  This  strategy  is  also  efficient  in  the 
sense  that  it  maximizes  the  range  of 
movement  of  the  eyes  by  maintaining 
them  most  of  the  time  far  from  limit 
configurations. 

The  control  of  vergence 

Vergence  control  is  provided  by  a 
completely  separated  loop.  Note  also 
that  the  disparity  measurement  process  is 
separated;  this  reduces  the  chances  of  a 
conflict  between  the  pursuit  and  the 


“verge”  behavior.  The  controller  in  this 
case  tries  to  keep  the  disparity  d  close  to 
zero.  Vergence,  together  with  the  control 
of  tracking,  assures  that  the  object  of 
interest  is  kept  almost  in  the  foveae  (left 
and  right  eyes). 

Saccades 

Saccades  neatly  complement  the  pursuit 
controllers  and  gaze  stabilization  when 
the  object  moves  too  fast  to  be 
appropriately  followed,  or,  on  the  other 
hand  when  a  rapid  shift  of  attention  is 
required  (because  a  new  more  salient 
target  appeared).  When  performing  a 
saccade,  the  gaze  stabilization  behaviors 
get  temporarily  inhibited.  The  precise 
computation  of  saccades  would  require 
the  knowledge  of  the  mapping  between 
the  retinal  error  and  the  appropriate 
motor  command  (a  learning  strategy  has 
been  investigated  in  [21]). 

In  our  case  we  resorted  to  a  simpler 
implementation  by  using  a  linear  map, 
which  has  been  tuned  by  hand  (it  is 
substantially  only  a  gain  matrix).  A  final 
note  concerns  the  actual  activation  of 
saccades:  the  logic  behind  their 
generation  simply  checks  whether  the 
target  is  outside  the  fovea  (defined  by  a 
threshold)  but  also  if  a  refractory  period 
has  elapsed.  The  latter  is  required  to 
stabilize  the  system.  In  fact,  saccades, 
acting  as  a  very  high-gain  controller, 
might  lead  to  unstable  behaviors. 

Figure  4  below  is  intended  to  give  the 
general  flavor  of  how  the  different 
control  loops  are  combined  and 
organized  (see  caption  for  details). 
Figure  5  shows  an  example  trajectory:  it 
is  possible  to  note  the  activation  of  the 
two  control  modes  (open-  and  closed- 
loop). 


Figure  4  A  schematic  of  the  head  controller.  Different  signals  (top)  are  used  to  build  different 
independent  control  loop.  Each  module  generates  a  velocity  command.  For  what  concerns  gaze 
stabilization,  the  velocity  commands  are  combined  eventually  by  summating  them.  Saccades  are 
independently  calculated  and  activated  only  when  needed  by  the  saccade  control  logic.  Finally 
velocity  commands  are  sent  to  a  low-level  PID  controller,  which  generates  the  appropriate  signals  to 
drive  the  motors. 


Figure  5  Robot  behavior.  The  plot  on  the  left  shows  the  horizontal  component  of  the  eye  movement 
command.  Note  the  two  peaks  due  to  the  generation  of  two  saccades.  The  plot  on  the  right  shows 
instead  the  position  of  the  fixation  point  in  2D  (the  height  is  not  shown)  corresponding  to  the  same 
movement  on  the  left. 


Conclusions 

This  paper  addressed  the  problem  of 
designing  and  realizing  a  biologically 
inspired  attentional  system  for  a 
humanoid  robot.  We  showed  through  the 
use  of  space  variant  vision  that  it  is 
possible  to  maximally  exploit  the 
available  computational  power  without 
compromising  the  ability  to  perform 
accurate  movements.  The  benefits  are 
still  moderate  at  the  present  resolution; 
64x32  pixels  log-polar  images  were 
employed.  The  ratio  between  the  log- 
polar  and  the  corresponding  Cartesian 
would  grow  even  further  with  the 
increase  of  the  resolution.  For  instance  a 
33000  pixel  log-polar  image  would 
correspond  (under  certain  hypothesis)  to 
a  million-pixel  rectangular  image 
(assuming  that  the  foveal  resolution  is 
the  same). 

We  showed  also  how  optic  flow,  color 
and  depth  cues  could  be  estimated  from 
log-polar  images.  Not  less  important,  we 
showed  how  a  simplified  coordination 
schema  of  head  and  eye  movements 
could  be  devised  under  the  hypothesis 
that  compensatory  eye  movement  can  be 
generated. 

It  is  important  to  note  that  the 
architecture  is  completely  bottom-up. 
We  are  aware  that  this  is  a  biologically 
implausible  simplification;  in  our  view 
this  has  to  be  considered  only  a  very  first 
visuo-motor  coordination  layer. 
Furthermore,  the  integration  mechanism 
is  not  tuned  on  the  basis  of  the  current 
state  of  the  robot  or  the  task  at  hand. 
However,  this  has  been  already 
investigated  in  other  contexts,  by  for 
example  [22],  and  it  is  likely  to  be 
inserted  in  this  model  as  investigation 
proceeds. 

Future  work  will  include  object 
recognition  abilities.  In  this  context,  the 


multi-cue  approach  is  extremely 
effective  in  driving  the  exploration  of  the 
environment,  and  thus  in  facilitating  the 
acquisition  of  training  samples  for 
autonomous  learning. 
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